feat: add spawn circuit breaker, lifecycle logging, and heartbeat auto-recovery#198
feat: add spawn circuit breaker, lifecycle logging, and heartbeat auto-recovery#198vltbaudbot wants to merge 2 commits intomodem-dev:mainfrom
Conversation
Greptile SummaryThis PR adds three resilience features: a spawn circuit breaker (3-strike, 5-minute cooldown) in Key findings:
Confidence Score: 3/5
Important Files Changed
Flowchart%%{init: {'theme': 'neutral'}}%%
flowchart TD
A[agent_spawn called] --> B{Circuit open?}
B -- Yes --> C[Log circuit_rejected\nReturn circuit_open error]
B -- No --> D[Validate inputs]
D -- Invalid --> E[Return validation error\nNo circuit trip]
D -- Valid --> F[Log spawn_started\nRun tmux new-session]
F -- tmux code ≠ 0 --> G[recordFailure\nLog spawn_failed stage=spawn]
F -- tmux code = 0 --> H[waitForSessionReadiness]
H -- aborted --> I[recordFailure\nLog spawn_failed stage=aborted]
H -- timeout --> J[recordFailure\nLog spawn_failed stage=readiness]
H -- ready --> K[recordSuccess\nLog spawn_success]
G --> L{consecutiveFailures ≥ 3?}
I --> L
J --> L
L -- Yes --> M[state = open]
L -- No --> N[state unchanged]
subgraph Heartbeat Auto-Recovery
P[failures detected] --> Q{failure.name = bridge?}
Q -- Yes --> R[Kill bridge tmux\nKill port 7890\nRun startup-pi.sh]
R --> S{Startup script\nexists AND uuids?}
S -- No --> T[⚠️ No action logged\nBridge remains dead]
S -- Yes --> U[Verify bridge\nLog RecoveryAction]
Q -- No --> V{failure.name\nstartsWith orphan:?}
V -- Yes --> W[Kill tmux session\nRemove alias\nLog success]
end
U --> X{All recovered?}
W --> X
X -- Yes --> Y[No LLM tokens consumed]
X -- No --> Z[Inject heartbeat prompt\nwith recovery details]
Last reviewed commit: 760d1e3 |
pi/extensions/heartbeat.ts
Outdated
| // Kill existing bridge tmux session | ||
| try { | ||
| execSync('tmux kill-session -t baudbot-gateway-bridge 2>/dev/null', { timeout: 5000 }); | ||
| } catch { | ||
| // May not exist — that's fine | ||
| } | ||
|
|
||
| // Kill anything holding port 7890 | ||
| try { | ||
| execSync('lsof -ti :7890 2>/dev/null | xargs kill -9 2>/dev/null', { timeout: 5000 }); | ||
| } catch { | ||
| // Nothing holding port — fine | ||
| } | ||
|
|
||
| // Restart via startup script | ||
| const startupScript = join(homedir(), ".pi", "agent", "skills", "control-agent", "startup-pi.sh"); | ||
| if (existsSync(startupScript)) { | ||
| // Get live session UUIDs from session-control dir | ||
| const sockFiles = readdirSync(SOCKET_DIR).filter((f) => f.endsWith(".sock")); | ||
| const uuids = sockFiles.map((f) => f.replace(".sock", "")).join(" "); | ||
| if (uuids) { | ||
| execSync(`bash "${startupScript}" ${uuids} 2>&1`, { | ||
| timeout: 30000, | ||
| encoding: "utf-8", | ||
| }); | ||
|
|
||
| // Verify bridge came back | ||
| await new Promise((resolve) => setTimeout(resolve, 3000)); | ||
| const verifyResult = await checkBridge(); | ||
|
|
||
| const entry: RecoveryAction = { | ||
| timestamp: new Date().toISOString(), | ||
| check: failure.name, | ||
| action: "bridge_restart", | ||
| success: verifyResult.ok, | ||
| detail: verifyResult.ok | ||
| ? "Bridge restarted and verified healthy" | ||
| : `Bridge restart attempted but still failing: ${verifyResult.detail}`, | ||
| }; | ||
| actions.push(entry); | ||
| logRecovery(entry); | ||
| } |
There was a problem hiding this comment.
Bridge killed before verifying restart is possible
The bridge session and port 7890 are killed before verifying that either the startup script exists or that there are live socket UUIDs to pass it. If startupScript doesn't exist, or sockFiles is empty (no .sock files), the code falls through both inner if guards without pushing any RecoveryAction entry and without logging anything. The bridge is now dead but recoveryActions contains no entry for "bridge", so:
recoveredCheckswill NOT contain"bridge".- The bridge failure remains in
remainingFailures— the agent is prompted. - But the agent has no idea the previous bridge tmux session was already killed and port 7890 was cleared — it's working against a ghost.
A minimal fix is to push a "partial failure" action in the missing branches:
// After killing bridge/port but before the startup script check:
if (!existsSync(startupScript)) {
const entry: RecoveryAction = {
timestamp: new Date().toISOString(),
check: failure.name,
action: "bridge_restart",
success: false,
detail: `Bridge session killed but startup script not found at ${startupScript} — cannot restart`,
};
actions.push(entry);
logRecovery(entry);
continue;
}
// …and similarly if uuids is empty:
if (!uuids) {
const entry: RecoveryAction = {
timestamp: new Date().toISOString(),
check: failure.name,
action: "bridge_restart",
success: false,
detail: "Bridge session killed but no live socket UUIDs found — cannot restart",
};
actions.push(entry);
logRecovery(entry);
continue;
}This ensures the agent prompt always includes the full picture of what auto-recovery attempted.
Prompt To Fix With AI
This is a comment left during a code review.
Path: pi/extensions/heartbeat.ts
Line: 614-655
Comment:
**Bridge killed before verifying restart is possible**
The bridge session and port 7890 are killed *before* verifying that either the startup script exists or that there are live socket UUIDs to pass it. If `startupScript` doesn't exist, or `sockFiles` is empty (no `.sock` files), the code falls through both inner `if` guards without pushing any `RecoveryAction` entry and without logging anything. The bridge is now dead but `recoveryActions` contains no entry for `"bridge"`, so:
1. `recoveredChecks` will NOT contain `"bridge"`.
2. The bridge failure remains in `remainingFailures` — the agent is prompted.
3. But the agent has no idea the previous bridge tmux session was already killed and port 7890 was cleared — it's working against a ghost.
A minimal fix is to push a "partial failure" action in the missing branches:
```typescript
// After killing bridge/port but before the startup script check:
if (!existsSync(startupScript)) {
const entry: RecoveryAction = {
timestamp: new Date().toISOString(),
check: failure.name,
action: "bridge_restart",
success: false,
detail: `Bridge session killed but startup script not found at ${startupScript} — cannot restart`,
};
actions.push(entry);
logRecovery(entry);
continue;
}
// …and similarly if uuids is empty:
if (!uuids) {
const entry: RecoveryAction = {
timestamp: new Date().toISOString(),
check: failure.name,
action: "bridge_restart",
success: false,
detail: "Bridge session killed but no live socket UUIDs found — cannot restart",
};
actions.push(entry);
logRecovery(entry);
continue;
}
```
This ensures the agent prompt always includes the full picture of what auto-recovery attempted.
How can I resolve this? If you propose a fix, please make it concise.
pi/extensions/heartbeat.ts
Outdated
| try { | ||
| // Kill the tmux session | ||
| try { | ||
| execSync(`tmux kill-session -t "${sessionName}" 2>/dev/null`, { timeout: 5000 }); |
There was a problem hiding this comment.
Shell injection via unsanitized session name from filesystem
sessionName is derived directly from filenames in SOCKET_DIR (e.g. a file named dev-agent-foo"; rm -rf ~; echo ".alias). While the directory is inside ~/.pi/, the shell interpolation via template literal in execSync would execute arbitrary injected commands if such a filename existed.
Prefer the array-argument form of spawnSync / execFileSync which bypasses the shell entirely:
import { execFileSync } from "node:child_process";
// …
execFileSync("tmux", ["kill-session", "-t", sessionName], { timeout: 5000 });Similarly on line 635 for the startup script invocation — the UUIDs should each be passed as separate shell-quoted arguments, not space-joined into a single string.
Prompt To Fix With AI
This is a comment left during a code review.
Path: pi/extensions/heartbeat.ts
Line: 676
Comment:
**Shell injection via unsanitized session name from filesystem**
`sessionName` is derived directly from filenames in `SOCKET_DIR` (e.g. a file named `dev-agent-foo"; rm -rf ~; echo ".alias`). While the directory is inside `~/.pi/`, the shell interpolation via template literal in `execSync` would execute arbitrary injected commands if such a filename existed.
Prefer the array-argument form of `spawnSync` / `execFileSync` which bypasses the shell entirely:
```typescript
import { execFileSync } from "node:child_process";
// …
execFileSync("tmux", ["kill-session", "-t", sessionName], { timeout: 5000 });
```
Similarly on line 635 for the startup script invocation — the UUIDs should each be passed as separate shell-quoted arguments, not space-joined into a single string.
How can I resolve this? If you propose a fix, please make it concise.
pi/extensions/agent-spawn.ts
Outdated
| try { | ||
| const { execSync } = require("node:child_process"); | ||
| const tail = execSync(`tail -20 "${LIFECYCLE_LOG_PATH}" 2>/dev/null`, { encoding: "utf-8" }); | ||
| if (tail.trim()) { | ||
| const lines = tail.trim().split("\n"); |
There was a problem hiding this comment.
Dynamic require() in an ESM/TypeScript file — use readFileSync instead
This file uses static ES module import declarations at the top, but spawn_status falls back to require("node:child_process") and a shell tail command to read the last 20 lines of the JSONL log. Two problems:
- Dynamic
require()is a CommonJS pattern and can be surprising in an ESM context. - Shelling out to
tailfor a simple file-tail operation is non-portable (unavailable on Windows, behaves differently under PATH changes) and redundant —readFileSyncis already available as a static import in this file.
A portable alternative using the already-imported readFileSync:
import { appendFileSync, existsSync, mkdirSync, readFileSync, readlinkSync, statSync } from "node:fs";
// …
// Inside spawn_status execute():
let recentEvents = "";
try {
if (existsSync(LIFECYCLE_LOG_PATH)) {
const lines = readFileSync(LIFECYCLE_LOG_PATH, "utf-8")
.trimEnd()
.split("\n")
.slice(-20);
// … same parse/format logic
}
} catch {
recentEvents = " (no lifecycle log)";
}The same require("node:child_process") / require("node:fs") pattern also appears in heartbeat.ts (lines 595, 684) where top-level static imports should be used instead.
Prompt To Fix With AI
This is a comment left during a code review.
Path: pi/extensions/agent-spawn.ts
Line: 544-548
Comment:
**Dynamic `require()` in an ESM/TypeScript file — use `readFileSync` instead**
This file uses static ES module `import` declarations at the top, but `spawn_status` falls back to `require("node:child_process")` and a shell `tail` command to read the last 20 lines of the JSONL log. Two problems:
1. Dynamic `require()` is a CommonJS pattern and can be surprising in an ESM context.
2. Shelling out to `tail` for a simple file-tail operation is non-portable (unavailable on Windows, behaves differently under PATH changes) and redundant — `readFileSync` is already available as a static import in this file.
A portable alternative using the already-imported `readFileSync`:
```typescript
import { appendFileSync, existsSync, mkdirSync, readFileSync, readlinkSync, statSync } from "node:fs";
// …
// Inside spawn_status execute():
let recentEvents = "";
try {
if (existsSync(LIFECYCLE_LOG_PATH)) {
const lines = readFileSync(LIFECYCLE_LOG_PATH, "utf-8")
.trimEnd()
.split("\n")
.slice(-20);
// … same parse/format logic
}
} catch {
recentEvents = " (no lifecycle log)";
}
```
The same `require("node:child_process")` / `require("node:fs")` pattern also appears in `heartbeat.ts` (lines 595, 684) where top-level static imports should be used instead.
How can I resolve this? If you propose a fix, please make it concise.- Fix circuit breaker race: separate isCircuitOpen check from half-open state transition, call transitionToHalfOpen only after all input validation passes (Sentry) - Fix orphan cleanup error swallowing: inspect tmux kill-session errors, only treat 'session not found' as success, report real failures (Sentry) - Fix bridge recovery gaps: add explicit failure actions when startup script is missing or no live socket UUIDs found (Greptile) - Fix shell injection: use execFileSync for tmux commands instead of execSync with string interpolation (Greptile) - Replace all dynamic require() calls with static ES module imports in both agent-spawn.ts and heartbeat.ts (Greptile) - Replace shell tail command with readFileSync for lifecycle log reading in spawn_status tool (Greptile)
…o-recovery Addresses remaining items from modem-dev#185 (control-agent resilience): After 3 consecutive spawn failures, the circuit opens and rejects new spawn attempts for 5 minutes (cooldown). This prevents resource waste when spawns are failing due to systemic issues (missing API keys, model unavailability, etc.). The circuit transitions: closed → open (after 3 failures) → half-open (after cooldown) → closed (on success) Failure tracking counts tmux spawn failures, readiness timeouts, and aborted readiness checks. Validation errors (bad name, missing model) don't affect the circuit. All spawn events are logged to ~/.pi/agent/logs/worker-lifecycle.jsonl: - spawn_started: when a spawn attempt begins - spawn_success: readiness verified (includes ready_after_ms) - spawn_failed: tmux error, readiness timeout, or abort - circuit_rejected: spawn refused by open circuit New `spawn_status` tool exposes circuit breaker state and recent lifecycle events for observability. Before prompting the control-agent about failures, the heartbeat now attempts automatic recovery for two failure types: - Bridge down: kills existing bridge tmux session, clears port holders, runs startup-pi.sh, verifies bridge comes back - Orphaned dev-agents: kills tmux session, removes stale alias If recovery succeeds, no LLM tokens are consumed (same as healthy check). Only unrecoverable failures prompt the agent. Recovery actions are logged to ~/.pi/agent/logs/auto-recovery.jsonl for audit and debugging. - Fixed test harness to handle multiple tool registrations - Added circuit breaker test (3 failures → open → rejected) - Added spawn_status tool registration test - All 128 tests pass (73 heartbeat + 6 agent-spawn + 49 memory) Refs modem-dev#185 Co-authored-by: Darcy Clarke <darcy@darcyclarke.me>
- Fix circuit breaker race: separate isCircuitOpen check from half-open state transition, call transitionToHalfOpen only after all input validation passes (Sentry) - Fix orphan cleanup error swallowing: inspect tmux kill-session errors, only treat 'session not found' as success, report real failures (Sentry) - Fix bridge recovery gaps: add explicit failure actions when startup script is missing or no live socket UUIDs found (Greptile) - Fix shell injection: use execFileSync for tmux commands instead of execSync with string interpolation (Greptile) - Replace all dynamic require() calls with static ES module imports in both agent-spawn.ts and heartbeat.ts (Greptile) - Replace shell tail command with readFileSync for lifecycle log reading in spawn_status tool (Greptile)
932bbd7 to
cafb91b
Compare
Summary
Addresses remaining resilience items from #185 — the three highest-value improvements that don't require pi core changes.
1. Spawn Circuit Breaker (
agent-spawn.ts)After 3 consecutive spawn failures, the circuit opens and rejects new spawn attempts for 5 minutes. This prevents resource waste when spawns are failing systemically (missing API keys, model unavailability, etc.).
Failure types that trip the circuit:
Validation errors (bad name, missing model, etc.) do NOT trip the circuit.
2. Worker Lifecycle Logging (
agent-spawn.ts)All spawn events are appended to
~/.pi/agent/logs/worker-lifecycle.jsonl:spawn_startedspawn_successready_after_ms)spawn_failedcircuit_rejectedNew
spawn_statustool exposes circuit breaker state and recent lifecycle events for observability.3. Heartbeat Auto-Recovery (
heartbeat.ts)Before prompting the control-agent about failures, the heartbeat now attempts automatic recovery for two safe, idempotent failure types:
If recovery succeeds, no LLM tokens are consumed — same as a healthy check. Only unrecoverable failures prompt the agent.
Recovery actions are logged to
~/.pi/agent/logs/auto-recovery.jsonl.What this addresses from #185
agent_spawn)Testing
spawn_statustool registration testtsc --noEmit✅biome lint✅Files changed
pi/extensions/agent-spawn.tsspawn_statustoolpi/extensions/heartbeat.tspi/extensions/agent-spawn.test.mjsRefs #185